25 May 2016

Overview

Motivation
Four general principles
Case study
Costs and benefits

Motivation

Universalism

'Communism'

Disinterestedness

Organized skepticism

Origins of scientific skepticism

Robert Boyle's vacuum pump

Documentation

'Communal witnessing'

Circumstances

Stodden on modern skepticism

Empirical Reproducibility

Computational Reproducibility

Statistical Reproducibility

Peng: 'For every X there is a Computational X'

Computational Biology

Computational Physics

Computational Chemistry

Computational Economics

Computational …

Dcoumentation breakdown

Computers are the new vacuum pump

Key ideas

Reproducibilty is necessary for scientific progress
Computers wrangle all the data, but also obscure it
Especially point-and-click actions
Technical solutions available in open source/format/data/access

Four general principles of reproducible research that have emerged across the sciences

1. Make openly available the data and methods that generated the published results

✓ Plain text file formats

✓ persistent URLs

Victoria Stodden's Reproducible Research Standard

✓ Data: CC-0 (public domain)

✓ Code: MIT (no liability for reuse)

✓ Text/Figures/Media: CC-BY (attribution required)

2. Write scripts to do everything

✗ Mouse gestures leave few traces that are enduring and accessible to others

✗ Easy to lose track of ah hoc changes in mouse-driven environments

✓ Scripts for data ingest, cleaning, analysis, visualizing, and reporting

✓ Scripts create a very high-resolution record of the research workflow in a plain text file that can be reused and inspected by others

3. Use version control to track changes

✗ Managing different versions of computer files is very challenging

✗ Poor version control leads to loosing track of the provenance of results

✓ VCS designed for software engineering are suitable for research code and text

✓ Commit history preserves a high-resolution, transparent record of the development of a file or set of files

✓ Enables remote collaborators to work together without overwriting each other’s work

4. Describe and archive the computational environment

✗ Minor changes in software can cripple complex research pipelines

✗ Managing software dependencies is tedious

✓ List of the key pieces software and their version numbers

✓ Archive a self-contained computational environment like a virtual machine or Linux container

Case Study

First principle

All files on figshare.com

Data in CSV format

Organised as an R package

Second principle

R & Rmarkdown documents

Third principle

All files tracked with Git, hosted on GitHub

Collaboration did not occur on GitHub because no co-authors used it

Fourth principle

Docker image and Dockerfile to contain RStudio, packages, code and external dependencies

Based on Rocker image and templates

Smaller than a VM

Extreme isolation

Gentleman & Temple Lang (2004)

"research compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection."

Research compendium+

README.md
R package
Version control
Environment

Costs & benefits

Costs

Time learning the tools

That's all

Built-in vs Bolt-on

Benefits

Comfort of knowing that I am right & have no secrets

Save time by reusing my previous code

Open data confers citation advantages, but magnitude is highly variable

Open Source community membership provides access to high-quality help

Two implications: Training

Two implications: Incentives

Summary

Open methods and materials, scripted workflow, version control and environment control are generic principles suitable for most fields of research

The specific tools will change over time, but the principles will endure

For most people, the technical problems already have good solutions, the remaining challenge is cultural (eg. syllabi & peer reviews)

Colophon